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ABSTRACT 


With advent of Unicode encoding, Punjabi language content, 
written using gurmukhi script as well as in shahmukhi script, is 
increasing day by day on internet. Processing textual information 
involves passing it to various pre-processing phases. Stop-word 
elimination is one such sub phase. 256 Gurmukhi stop words had 
been identified from poetry, stories and online material and passed 
to Punjabi stemmer. After stemming, 184 stemmed stop words 
were generated and these stemmed stop words were passed to 
transliteration phase. This led to generation of stop words in 
shahmukhi script. For the first time in scientific community 
dealing with computational linguistics and literature processing 
using NLP techniques, the list of 184 stop words of Punjabi 
language is released for public usage and further NLP 
applications. The presented list consists of stop words of Punjabi 
language with their Gurmukhi, Shahmukhi as well as Roman 


scripted forms. 


Categories and Subject Descriptors 

*Computing methodologies~Natural language processing 
* Computing methodologies~Artificial intelligence * Computing 
methodologies~Language resources 


General Terms 
Algorithms, Design, Human Factors, Languages. 


Keywords 


Gurmukhi, Natural Language Processing, Stop word, Shahmukhi, 
Punjabi. 


1. INTRODUCTION 


Pre-processing plays an important role in text mining area of 
computer science [8]. In order to prepare the data that can be used 
for mining useful information, data must be pre-processed. Pre- 
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processing of text is done for mainly to extract useful features 
from text. Various pre-processing steps include noise removal, 
special symbol removal and stop word removal. 

Stop words are words which does not have no significant semantic 
relation to the context in which they exist [8]. These are common 
words those that occur frequently in most of the documents in a 
given collection. They are extremely common words that do not 
provide any useful information to select documents. Thus, they 
must not be included as indexing terms. So these kinds of words 
must be eliminated from text because of two reasons. Firstly, it 
reduces the feature space of words and secondly, it increases the 
classifier accuracy. 

Removal of stop words is needed in many natural language 
processing applications like classification, segmentation, spelling 
normalization and stemming. Eliminating such words from 
consideration early in automatic indexing speeds processing, 
saves huge amounts of space in indexes, and does not damage 
retrieval effectiveness. 


2. LITERATURE SURVEY 


Text classification is an active research area in information 
retrieval and natural language processing. A fundamental step in 
text classification is a list of 'stop' words (stop word list) that is 
used to identify frequent words that are unlikely to assist in 
classification and hence are deleted during pre-processing. Till 
now, many stop word lists have been developed for various 
foreign languages such as Chinese, Arabic, and English. This 
section provides details of various works done for identification of 
stop words in foreign languages. 

Lili H. and Lizhu H. [16] had given a refined definition for stop 
words in Chinese text classification from a perspective of 
statistical correlation, then propose an automatic approach to 
extracting the stop word list in text classification based on the 
weighted Chi-squared statistic on 2*p contingency table. And then 
evaluated the list of stop words using accuracy obtained from text 
classification experiment. Yao and Zenwen [19] constructed a 
Chinese stop word list. Chinese English stop word list containing 
1289 words was constructed by merging the classical stop word 
list with the stop words depending on the different domain of the 
text document corpus. Savoy [18] defined a general stop word list 
for those words which serve no purpose for retrieval, but are used 
very frequently in composing the documents. They establish a 
general stop word list for French. First, all the word forms 
appearing in their French corpora is sorted according to their 
frequency of occurrence and extract the 200 most frequently 


occurring words. Second, all numbers, plus all nouns and 
adjectives more or less directly related to with the main subjects 
of the underlying collections is removed. Third, some non 
information bearing words, even if they did not appear in the first 
200 most frequent words are included. The suggested French 
general stop word list contains 215 words, and by using such a 
stop word list, the size of the inverted file was reduced by about 
21% for one test collection, and about 35% for the second corpus. 
Myerson [16], used two statistical measures such as document 
frequency and chi-square for identification if stop words. Then, 
x2(weighted Chi-squared statistic) was used to measure statistical 
correlation between a word and classification categories. X2for 
the words are calculated then ordered increasingly. Consecutively, 
the first word in the ordered list has the minimum value of 
weighted Chi-squared statistic, i.e. it has a higher document 
frequency and lesser correlations with all the categories. Chinese 
corpus of the Mayor’s Public Access Line Project texts was used 
to evaluate and, compare results of classifiers. 


Zheng and Gaowa [20] proposed a method for constructing stop- 
words list based on entropy calculation for Mongolian language. 
First, is to determine initial stop word lists then the entropy of 
every word is calculated and then ordered ascending to entropy. 
The second step is to combine results with the Mongolian part of 
speech to produce the final stop-word list. Zou et al. [21] used an 
aggregated model to measure both the word frequency 
characteristic by statistical model and its information 
characteristic by information model. The generated list was 
compared with other existing lists and showed an improvement 
over others. Elkhair [7] conducted a comparative study on the 
effect of stop words elimination on Arabic IR. Three stop lists 
were used in the comparison. General stop-list, corpus based stop- 
list and, a combined stop list. Alhadidi and Alwedyan [1] 
implemented a hybrid stop-word removal technique for Arabic 
language based on a dictionary and an algorithm. The proposed 
technique has been tested using a set of 242 Arabic abstracts 
chosen from the Proceedings of The Saudi Arabian National 
Computer conferences, and another set of data chosen from the 
Jordanian Alrai Newspaper. 

Saini [17] has used a stop word list consisting of five categories of 
stop words namely, generic stop words, HTML stop words, noise 
stop words, domain stop words and miss-spelling stop words. The 
said stop words have been used by the researcher for processing 
of English un-structured documents scripted in Roman. 


3. STOP WORDS FOR SHAHMUKHI 
SCRIPTED PUNJABI LANGUAGE 
DOCUMENTS 


India is a multilingual country where a large number of languages 
are spoken in day to day life. But language families that dominate 
are Indo-Aryan (which is spoken in North Western Region) and 
Dravidian language family (spoken in southern region). Sino- 
Tibetan is one of minority language family (spoken in eastern 
region). Indo-Aryan Language Family mainly consists of Hindi, 
Gujarati, Bengali, Punjabi, Marathi, Urdu and Sanskrit languages. 
Dravidian Family, similarly, mainly consists of Telugu, Tamil and 


Kannada languages while the Sino-Tibetan Family consists 
mainly of Manipuri, Meithei and Himalayish Languages [14]. 
Punjabi is an Indo-Aryan language spoken by 102 million native 
speakers worldwide, making it the 10th most widely spoken 
language. Punjabi is the most widely spoken language in Pakistan 
as a first language, the eleventh-most widely spoken in India, and 
the third-most spoken native language in the Indian Subcontinent. 
Punjabi is the fourth-most spoken language in the United 
Kingdom and third-most spoken native language in Canada. 
There are two ways to write Punjabi: Gurmukhi and Shahmukhi. 
The word Gurmukhi translates into "Guru's mouth",and 
Shahmukhi means "from the King's mouth". In the Punjab 
province of Pakistan, the script used is Shahmukhi and differs 
from the Urdu alphabet in having four additional letters. In the 
Indian state of Punjab, the Gurmukhi script is generally used for 
writing Punjabi [3]. 

In Punjabi language using gurmukhi script, 256 stop words were 
identified from poetries, stories and other online material [15]. 
Initially, 175 stop words are identified from various stories, news 
articles available online and 165 stops words are identified from 
poems collected in different categories discussed by Kaur J and 
Saini JR [15]. After the union of both the files, 256 unique stop 
words are identified from poems as well as news articles. 

These identified stop words are stemmed to convert to its root 
form. Stemming is way of converting a written text into its root 
form [4].Gupta V. [9] developed different rules for handling 
stemming for verbs, adverbs and pronouns. For example in 


Punjabi language, word aa [kudiya] ‘girls’ is converted to 


its root Sh kudiy ‘girl’. These stemming rules are manually 


applied to 256 stops words identified from poetry as well as other 
Punjabi documents. After applying these stemming rules to stop 
words obtained in the last step, 186 unique stop words are found. 
On lieu of Punjabi Grammar and Part of Speech (POS) based 
word class categorization, these 186 stemmed stop words are 
categorized into 4 different word classes: Adverbs [6], Verbs [6], 
Pronouns [6], Conjunctions [2] and other miscellaneous words. 
Any word which is not suitable for first four categories is assigned 
to miscellaneous one. 99 different adverb forms, 40 different 
verbs, 26 pronouns, 7 conjunctions are identified from 186 
stemmed stop words. And remaining 14 stop words are assigned 
to miscellaneous category [13]. 

All this work has been done in Punjabi language written using 
gurmukhi script. As explained earlier, in Pakistan, Punjabi 
language is also scripted using shahmukhi script. As there is 
unavailability of stop word list in Punjabi language written using 
shahmukhi script, these stemmed 184 gurumukhi stop words are 
transliterated to generate stop words in shahmukhi script. 
Transliteration is form of converting text present in one script into 
another script [5]. Gurmukhi to Shahmukhi transliteration system 
is designed by Punjabi university Patiala and is available online 
[11]. List of transliterated shahmukhi stop words is presented in 
Table I. This table consists of word in gurmukhi script followed 
by its transliterated form in shahmukhi script, which is followed 
by its transliteration and translation in Roman Script. 


Table I. List of stop words in Gurmukhi, Shahmukhi and Roman script 


S.N Word in | Word in | Word in Meanin S. No Word in | Word in | Word in Meanin 
se Gurumukhi shahmukhi Roman 8 corn Gurumukhi shahmukhi Roman 8 
je) : ; o> = who, 
1 fer [isa] this 2 fra [jisa] oa 


3 fea [vica] in the 4 roy [na] no 

5 3 sai [taka] up 6 d¢ w [huna ] now 

7 a G3 [vil too 8 faa? Jus [jinam] whom 
9 63? sil [othon] upon 10 aly Js [nala] with 
ll ror ues [nahiria] 10 12 ad at [cahé ] either 
13 at it 2 [bhi] too 14 fan us [kisa ] what 
15 ze? uss [valor ] by 16 fue? VIS [pichom] after 
7 fra ~ J [iha ] this 18 eud aoe [edhara] around 
19 e a [iha ] this 20 $ uss [nil] to 

21 ne? use [jadérn] | when, while 22 wifad at [ajihé] such 
23 ae a s [ka] many 24 dt wt [hi] only 
25 se % [tada] then 26 a a [ké] by 

27 nied i [andar] within 28 Zw ty [hain] yes 

29 13 a [uté ] upon 30 ud Sie [bahuta] much 
31 Ary cos [sabuta] | complete 32 ad was [kafi] enough 
33 ad ws [kadi] sometime 34 g a4 [huné ] now 
35 a us 4 [ném] the 36 Bet wil [lat for 

37 al > fii] respect 38 fer ee [ki] that 
39 far aos [kisé ] someone 40 did Ba [magara] behind 
4] yd Ls [pira ] complete 42 ul ag [da ]] of 

4B rot ci [né] the 44 Sci ueavb [tar'‘ham] like 

45 de 2992 [hové] if 46 ed ats [phéra ] later 
47 Fad S22 [jékar] just in case 48 Bs a [vélé] times 
49 a a0 [dé] of 50 | = [othé] there 
51 fagart bes [jéhara] which 52 fas as [kité ] somewhere 
53 UMAe me [ba'ada] after 54 fe ae [ithé ] here 
55 Ag! | hls [sara ] all, whole 56 fad use [jinhanu] whom 
57 uv Us [cho] out 58 ne = [jad] when 
59 at eis [kadé] never 60 Sa Bay [vanga ] like 

61 Hg es [sab] all 62 PIG dls [doraan] during 
63 zl oe [tan] when 64 eda oy) [varaga ] like 
65 fx ns [ki] that 66 a > [jo] that 


67 rol) 7 [la] to attach 68 ade aus [karké] because 
69 yd ls [pura] complete 70 fasas Jou [bilkul] absolutely 
71 ares as [naale] also 72 ad a a [eho] such 

73 Z Us * [ton] from 74 aé of [kaun] who 

75 de ia [hona] be 16 fed Hk [pher] then 

77 uA? uel [paso] from 78 3t aii [tad] then 

a5 faa ee [jeha] litle 80 ae usly § [kolon] from 

81 eH ia [és] this 82 falar - [kina] how much 
83 faa? bs [jina] who 84 ne! IE [jivé] such as 

85 ay es [kujh] some 86 Jo? le a 2 [hethan] below 

87 wag! labs [dobara] by 88 Arg a Jie [saré] all 

89 Hu bee [sada] forever 90 faa = [jithé] where 

91 ee a [ethé] here 92 ae eas [koi] someone 
93 ad 2} [baré] about 94 at wf [ki] what 

95 ad as [kad] when to 96 Al o> [je] please 

97 ae 2 [kadé] | never 98 aha ube [di'am of 

99 Je aw [hoye] happen 100 Jul > [chala] goes 

101 a av [rahé] are 102 8 a [lai] take 

103 ae soa [bano] become 104 mae «S| [aakh] say 

105 eet oss [déni] give 106 ae u? [bana] made 

107 film ae. [pia] lying 108 ad a4 [kara ] do 

109 dima base [ho''a] happened 110 Ue Re [pain] falling 
ll ae} ws [ga'i] gone 112 afs ns [keh] say 

113 BA SA [laga] seem 114 va ae [chuké] : 

115 Je i [huda ] happen 116 Td See [keha] said 

117 me ae [janda] going 118 ageret eS [karvayei] conducted 
119 ay ar ay [vekha] see 120 were evs [banaye] created 
121 He ee [suna] hear 122 aia 4s [kitta] carried out 
123 mre ci [a't] occurred 124 Hee Osle [javan] going 

125 Hae eK [sakdé] can 126 ty a 2 [dekh] see 

127 Ae ase [javé go 128 wife wl [adi] so on 

129 we ba [janda] going 130 four ae [li'a] taken 


131 ade ws [karana] doing 132 mT [a] come 
133 sreey | US [lagoda] mee 134 for a [reha] oin 
sol 8 involving eee 

135 mie a3) [aavé] arrives 136 fomr us [geya] been 
137 adt was [kari] do 138 8s él [otha] arise 
139 Bien Ly [laeya] attach 140 at ead [rahi] been 
141 ofa 2 [reh] living 142 GAS = [usné] he 
143 is) aa [uha ] he, she 144 oA uss [tusi] you 
145 ™ bs [sara] Was 146 va oe [mera] my 
147 Hg #4 [sabha ] All 148 Crd eal [usdi] his 
149 da o [hana] Are 150 3 eae [tera] your 
151 Z us 3 [tu] You 152 Asta ul [us] his 
153 Ht om [si] Was 154 ee sl [oyé] person 
155 J 3 [ho] Are 156 mY il [aap] you 
157 38 use 4 [ténu] You 158 HO om [san] was 
159 oA os [tusa] You 160 » us [mein] i 
161 ZS Ut [hain] are 162 ort ot [tusi] you 
163 J a [hai ] is 164 WF? ws! [assi] we 
165 TY ST <a [apna] my 166 ud I? [par[ but 
167 A = [ie] if 168 3 ai [té] and 
169 3 = [aaté] and 170 Zz Js flan] 50 
171 m ue [jan ] ‘ar 172 gre) vase 2 [bhavem] although 
173 aS ds [kal] total 174 mate J [aagali] next 
175 ae oR eS [vagaira] etc 176 ESF [varg] category 
177 dy ) [rakh] put 178 TH alc [ama] common 
179 Sd Hi [laag] take 180 Bi Y [la ] apply 
181 aes us [gal] thing 182 rairss dia [hala ] condition 
183 ut oi [pil drink 184 fea ss! [ek] one 

4. CONCLUSION Roman script) forms. The list presented here is released for public 

use for NLP in Shahmukhi scripted documents. 


Stop-words are functional and general words of the language that 
usually do not contribute to the semantics of the documents and 
have no read added value. The removal of such words contributes 
to the improvement of classifier efficiency. Punjabi language can 
be written using two different scripts, Gurumukhi and Shahmukhi. 
In this paper, 184 stemmed Gurumukhi stop words are presented 
in its transliterated (in Shahmukhi script) and translated (in 
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